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Abstract 


Ensemble  prediction  systems  typically  show  positive  spread-error  correlation,  but 
they  are  subject  to  forecast  bias  and  under  dispersion,  and  therefore  uncalibrated.  This 
work  proposes  the  use  of  ensemble  model  output  statistics  (EMOS),  an  easy  to  imple¬ 
ment  post-processing  technique  that  addresses  both  forecast  bias  and  underdispersion 
and  takes  account  of  the  spread-skill  relationship.  The  technique  is  based  on  multiple  lin¬ 
ear  regression  and  akin  to  the  superensemble  approach  that  has  traditionally  been  used 
for  deterministic-style  forecasts.  The  EMOS  technique  yields  probabilistic  forecasts  that 
take  the  form  of  Gaussian  predictive  probability  density  functions  (PDFs)  for  continuous 
weather  variables,  and  can  be  applied  to  gridded  model  output.  The  EMOS  predictive 
mean  is  an  optimal,  bias-corrected  weighted  average  of  the  ensemble  member  forecasts, 
with  coefficients  that  are  constrained  to  be  nonnegative  and  associated  with  the  member 
model  skill.  The  EMOS  predictive  mean  provides  a  highly  accurate  deterministic-style 
forecast.  The  EMOS  predictive  variance  is  a  linear  function  of  the  ensemble  spread.  For 
fitting  the  EMOS  coefficients,  the  method  of  minimum  CRPS  estimation  is  introduced. 
The  minimum  CRPS  estimator  finds  the  coefficient  values  that  optimize  the  contin¬ 
uous  ranked  probability  score  (CRPS)  for  the  training  data.  The  EMOS  technique 
was  applied  to  48-hour  forecasts  of  sea  level  pressure  and  surface  temperature  over  the 
North  American  Pacific  Northwest  in  Spring  2000,  using  the  University  of  Washington 
mesoscale  ensemble.  When  compared  to  the  bias-corrected  ensemble,  deterministic-style 
EMOS  forecasts  of  sea  level  pressure  had  root-mean-square  error  9%  less  and  mean  ab¬ 
solute  error  8%  less.  The  EMOS  predictive  PDFs  were  much  better  calibrated  than  the 
raw  ensemble  or  the  bias-corrected  ensemble,  and  they  were  sharp  in  that  prediction 
intervals  were  considerably  shorter  on  average  than  those  obtained  from  climatological 
forecasts.  Perhaps  surprisingly,  the  EMOS  ensemble  was  frequently  sharper  than  the 
raw  ensemble.  When  compared  to  the  bias-corrected  ensemble,  EMOS  improved  the 
continuous  ranked  probability  score  by  16%.  It  also  improved  the  ignorance  score  by 
3.7,  corresponding  to  the  predictive  PDF  at  the  verifying  observation  being  greater  by 
a  factor  of  40. 
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1  Introduction 


During  the  past  decade,  the  use  of  forecast  ensembles  for  assessing  the  uncertainty 
of  numerical  weather  predictions  has  become  routine.  Three  operational  methods  for 
the  generation  of  synoptic-scale  ensembles  have  been  developed:  The  breeding  growing 
modes  method  used  by  the  US  National  Centers  for  Environmental  Prediction  (Toth  and 
Kalnay  1993),  the  singular  vector  method  used  by  the  European  Centre  for  Medium- 
Range  Weather  Forecasts  (Molteni  et  al.  1996),  and  the  perturbed  observations  method 
used  by  the  Canadian  Meteorological  Centre  (Hontekamer  et  ah  1996).  More  recently, 
mesoscale  short-range  ensembles  have  been  developed,  such  as  the  University  of  Wash¬ 
ington  ensemble  system  over  the  North  American  Pacific  Northwest  (Grimit  and  Mass 
2002;  Eckel  2003).  The  ability  of  ensemble  systems  to  improve  deterministic-style  fore¬ 
casts  and  to  predict  forecast  skill  has  been  convincingly  established.  Statistically  signif¬ 
icant  spread-error  correlations  suggest  that  ensemble  variance  and  related  measures  of 
ensemble  spread  are  skillful  indicators  of  the  accuracy  of  the  ensemble  mean  forecast. 

Case  studies  in  probabilistic  weather  forecasting  have  typically  focused  on  the  predic¬ 
tion  of  categorical  events.  Ensembles  also  allow  for  probabilistic  forecasts  of  continuous 
weather  variables,  such  as  air  pressure  and  temperature,  that  are  expressed  in  terms 
of  predictive  probability  density  functions  (PDFs)  or  predictive  cumulative  distribution 
functions  (CDFs).  Due  to  the  limited  size  of  current  ensemble  systems,  which  typically 
consist  of  five  to  fifty  ensemble  member  forecasts,  raw  ensemble  output  does  not  pro¬ 
vide  predictive  PDFs,  and  some  form  of  post-processing  is  required  (Richardson  2001). 
However,  various  challenges  in  the  statistical  post-processing  of  ensemble  output  have 
been  described.  Systematic  biases  are  substantial  in  current  modeling  systems  (Atger 
2003;  Mass  2003)  and  might  disguise  probabilistic  forecast  skill.  Furthermore,  forecast 
ensembles  are  typically  underdispersive  (Hamill  and  Colucci  1997;  Eckel  and  Walters 
1998). 

In  this  paper,  we  propose  the  use  of  ensemble  model  output  statistics  (EMOS),  an 
easy  to  implement  statistical  post-processing  technique  that  addresses  the  aforemen¬ 
tioned  issues.  Our  method  is  a  variant  of  multiple  linear  regression  or  model  output 
statistics  (MOS)  techniques  that  have  traditionally  been  used  for  deterministic-style  and 
probability  of  precipitation  forecasts  (Glahn  and  Lowry  1972;  Wilks  1995).  Specifically, 
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suppose  that  Xi, . . . ,  Xrn  denotes  an  ensemble  of  individually  distinguishable  forecasts 
for  a  univariate  weather  quantity  Y.  A  multiple  linear  regression  equation  for  Y  in  terms 
of  the  ensemble  member  forecasts  can  be  written  as 

Y  —  cl  +  b\X\  +  •  •  •  +  bmXm  +  e  (1) 

where  a  and  are  regression  coefficients,  and  where  e  is  an  error  term  that 

averages  to  zero.  Regression  approaches  of  this  type  have  been  shown  to  improve  the 
deterministic-style  forecast  accuracy  of  synoptic  weather  and  seasonal  climate  ensembles 
(Krishnamurti  et  al.  1999,  2000;  Kharin  and  Zwiers  2002),  and  the  associated  forecast 
systems  have  been  referred  to  as  superensembles. 

The  use  of  regression  techniques  for  probabilistic  forecasting  has  not  received  much 
attention  in  the  literature,  with  the  exception  of  forecasts  of  binary  events  (Glahn  and 
Lowry  1972;  Stefanova  and  Krishnamurti  2002).  In  this  work,  we  obtain  full  predictive 
PDFs  and  CDFs  from  ensemble  forecasts  of  a  continuous  weather  variable.  Standard 
regression  theory  suggests  a  straightforward  way  of  constructing  predictive  PDFs  and 
CDFs  from  a  regression  equation,  by  taking  them  to  be  Gaussian  with  predictive  mean 
equal  to  the  regression  estimate,  and  predictive  variance  equal  to  the  mean  squared 
prediction  error  for  the  training  data.  This  approach  corrects  for  model  biases  and  takes 
account  of  underdispersion.  However,  the  resulting  assessment  of  uncertainty  is  static, 
in  that  the  predictive  variance  is  independent  of  the  ensemble  spread,  thereby  negating 
the  spread-skill  relationship  (Whitaker  and  Loughe  1998).  Hence,  we  model  the  variance 
of  the  error  term  in  the  multiple  linear  regression  equation  (1)  as  a  linear  function  of 
the  ensemble  spread,  that  is, 

Var(e)  =  c  +  dS 2  (2) 

where  S 2  is  the  ensemble  variance,  and  where  c  and  d  are  nonnegative  coefficients. 
Combining  (1)  and  (2)  yields  the  Gaussian  predictive  distribution 

J\f  (a  +  b\X\  +  •  •  •  +  bmXm,  c  +  dS2^j 

whose  mean  equals  the  regression  estimate  and  forms  a  bias-corrected  weighted  average 
of  the  ensemble  member  forecasts,  and  whose  variance  depends  linearly  on  the  ensemble 
spread.  Negative  regression  weights  can,  and  frequently  do,  occur  in  this  type  of  formu¬ 
lation,  as  in  Tables  2,  4,  5,  and  6  of  van  den  Dool  and  Rukhovets  (1994).  This  is  an 
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Table  1:  Phase  I  of  the  University  of  Washington  mesoscale  short-range  ensemble, 
January-June  2000.  Initial  conditions  (ICs)  and  lateral  boundary  conditions  (LBCs) 
were  obtained  from  the  Aviation  Model  (AVN),  the  Nested  Grid  Model  (NGM)  Re¬ 
gional  Data  Assimilation  System,  and  the  Eta  Data  Assimilation  System,  all  run  by  the 
US  National  Centers  for  Environmental  Prediction  (NCEP),  the  Global  Environmental 
Multiscale  (GEM)  analysis  run  by  the  Canadian  Meteorological  Centre  (CMC),  and  the 
US  Navy  Operational  Global  Atmospheric  Prediction  System  (NOGAPS)  analysis  run 
by  Fleet  Numerical  Meteorology  and  Oceanography  Center  (FNMOC).  See  Grimit  and 
Mass  (2002)  for  details. 


No. 

Ensemble  Member 

IC/LBC  Source 

1 

AVN-MM5 

NCEP 

2 

GEM-MM5 

CMC 

3 

ETA-MM5 

NCEP 

4 

NGM-MM5 

NCEP 

5 

NOGAP  S-MM5 

FNMOC 

artifact  caused  by  the  collinearity  of  the  ensemble  member  forecasts,  and  the  negative 
weights  seem  hard  to  interpret.  They  imply,  all  else  equal,  that  sea  level  pressure,  say,  is 
predicted  to  be  lower  when  the  forecast  with  the  negative  weight  is  higher.  To  address 
this  issue,  we  estimate  the  statistical  model  under  the  constraint  that  the  coefficients 
bi, ,  bm,  as  well  as  c  and  d,  are  nonnegative.  We  refer  to  the  resulting  predictive  PDFs 
and  CDFs  as  ensemble  model  output  statistics  or  EMOS  forecasts. 

We  applied  the  EMOS  technique  to  the  University  of  Washington  mesoscale  short- 
range  ensemble  described  by  Grimit  and  Mass  (2002).  Briefly,  this  is  a  multi-analysis, 
single-model  (MM5)  ensemble  driven  by  initial  conditions  and  lateral  boundary  con¬ 
ditions  obtained  from  major  operational  weather  centers  worldwide.  Table  1  provides 
an  overview  of  the  Phase  I  University  of  Washington  ensemble  system.  Figure  1  illus¬ 
trates  the  spread-skill  relationship  for  sea  level  pressure  forecasts,  using  the  same  period 
January  -  June  2000  on  which  the  study  of  Grimit  and  Mass  (2002)  was  based.  The 
ensemble  spread  provides  useful  information  about  the  error  of  the  ensemble  mean  fore¬ 
cast.  Figure  2  gives  an  example  of  a  48-hour  EMOS  forecast  of  sea  level  pressure.  This 
forecast  was  initialized  0000  UTC  25  May  2000  and  was  valid  at  Hope  Airport,  British 
Columbia.  Both  the  EMOS  predictive  PDF  and  the  EMOS  predictive  CDF  are  shown. 
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Spread-Skill  Relationship  for  Sea-Level  Pressure  Forecasts 


1  23456789  10 

Decile  of  Ensemble  Variance 


Figure  1:  Spread-skill  relationship  for  ensemble  mean  forecasts  of  sea  level  pressure  over 
the  Pacific  Northwest,  January  -  June  2000.  For  each  decile  of  the  ensemble  variance, 
the  boxplots  show  the  10%,  25%,  50%,  75%,  and  90%  percentile  of  the  squared  forecast 
error.  The  correlation  coefficient  between  the  ensemble  spread  and  the  squared  forecast 
error  was  0.33  for  individual  forecasts,  and  0.52  for  daily  averages  aggregated  across  the 
Pacific  Northwest. 

The  construction  of  prediction  intervals  from  the  predictive  CDF,  say  F,  is  straightfor¬ 
ward.  For  instance,  F(^)  and  F( |)  form  the  lower  and  upper  endpoint  of  the  66|% 
central  prediction  interval,  respectively.  In  the  Hope  Airport  example,  and  using  the 
millibar  as  unit,  this  interval  was  [1008.3,  1013.0].  The  ensemble  range  of  the  University 
of  Washington  ensemble  was  [1003.7,  1016.8].  For  a  five-member  ensemble,  this  is  also 
a  nominal  66|%  prediction  interval,  but  is  much  wider.  Perhaps  surprisingly,  this  sit¬ 
uation  -  EMOS  prediction  intervals  that  are  shorter  than  their  ensemble  counterparts 
-  was  not  uncommon.  In  our  case  study  this  occurred  in  about  27%  of  the  sea  level 
pressure  forecasts. 

The  paper  is  organized  as  follows.  In  Section  2  we  describe  the  EMOS  technique  in 
detail,  and  we  explain  how  we  go  about  verifying  probabilistic  forecasts.  In  assessing 
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(a)  Predictive  PDF 


(b)  Predictive  CDF 


Figure  2:  EMOS  48-hour  forecast  of  sea  level  pressure  at  Hope  Airport,  British 
Columbia,  initialized  0000  UTC  25  May  2000:  (a)  Predictive  PDF.  (b)  Predictive  CDF. 
Also  shown  are  the  five  ensemble  member  forecasts  (solid  and  broken  lines)  and  the 
verifying  observation  (dotted  line).  The  ETA-MM5  and  NGM-MM5  forecasts  (broken 
lines)  had  zero  EMOS  weight.  The  unit  used  is  the  millibar  (mb). 

forecast  PDFs,  we  are  guided  by  the  principle  that  probabilistic  forecasts  strive  to  max¬ 
imize  sharpness  subject  to  calibration  (Gneiting  et  al.  2003).  We  apply  diagnostic  tools, 
such  as  the  verification  rank  histogram  and  the  probability  integral  transform  (PIT) 
histogram,  as  well  as  scoring  rules,  among  them  the  continuous  ranked  probability  score 
(CRPS)  and  ignorance  score.  For  estimating  the  EMOS  coefficients,  we  introduce  the 
novel  approach  of  minimum  CRPS  estimation,  which  forms  a  particular  case  of  mini¬ 
mum  contrast  estimation.  Specifically,  we  find  the  coefficient  values  that  minimize  the 
continuous  ranked  probability  score  for  the  training  data.  For  EMOS,  this  method  gives 
better  results  than  classical  maximum  likelihood  estimation,  which  is  nonrobust  and 
favors  overdispersive  forecast  PDFs. 

Section  3  provides  a  case  study  of  EMOS  forecasts  for  sea  level  pressure  and  surface 
temperature  in  Spring  2000  over  the  Pacific  Northwest,  using  the  University  of  Washing¬ 
ton  ensemble.  We  explain  how  we  find  a  suitable  training  period,  and  we  describe  and 
verify  the  EMOS  forecasts.  The  EMOS  forecast  PDFs  were  much  better  calibrated  than 
the  raw  ensemble  or  the  bias-corrected  ensemble,  and  the  mean  absolute  error  (MAE), 
root-mean-square  error  (RMSE),  continuous  ranked  probability  score  (CRPS),  and  ig- 
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norance  score  (IGN)  for  the  EMOS  forecasts  were  consistently,  and  substantially,  better 
than  the  corresponding  quantities  for  the  raw  ensemble  or  the  bias-corrected  ensemble. 
The  paper  closes  with  a  discussion  in  Section  4. 

2  Methods 

We  now  explain  our  approach  to  verifying  probabilistic  forecasts,  and  we  describe  the 
ensemble  model  output  statistics  (EMOS)  technique  in  detail.  For  estimating  the  EMOS 
coefficients  we  use  the  novel  approach  of  minimum  CRPS  estimation,  which  forms  a 
special  case  of  minimum  contrast  estimation  (MCE).  This  method  is  best  explained  in 
terms  of  verification  measures,  so  we  describe  these  Erst. 

2.1  Assessing  sharpness  and  calibration 

The  goal  of  probabilistic  forecasting  is  to  maximize  the  sharpness  of  the  forecast  PDFs 
subject  to  calibration  (Gneiting  et  al.  2003).  Calibration  refers  to  the  statistical  con¬ 
sistency  between  the  forecast  PDFs  and  the  verifications,  and  is  a  joint  property  of  the 
predictions  and  the  verifications.  Briefly,  a  forecast  technique  is  calibrated  if  meteoro¬ 
logical  events  declared  to  have  probability  p  occur  a  proportion  p  of  the  time  on  average. 
Sharpness  refers  to  the  spread  of  the  forecast  PDFs  and  is  a  property  of  the  predictions 
only.  A  forecast  technique  is  sharp  if  prediction  intervals  are  shorter  on  average  than 
prediction  intervals  derived  from  naive  methods,  such  as  climatology  or  persistence.  The 
more  concentrated  the  forecast  PDFs  are,  the  sharper  the  forecast,  and  the  sharper  the 
better,  subject  to  calibration. 

The  principal  tool  for  assessing  the  calibration  of  ensemble  forecasts  is  the  verifica- 
tion  rank  histogram  or  Talagrand  diagram  (Anderson  1996;  Harnill  and  Colucci  1997; 
Talagrand  et  al.  1997;  Harnill  2001).  To  obtain  a  verification  rank  histogram,  find  the 
rank  of  the  verifying  observation  when  pooled  within  the  ordered  ensemble  values,  and 
plot  the  histogram  of  the  ranks. 

The  analogous  tool  for  PDF  forecasts  is  the  probability  integral  transform  (PIT) 
histogram.  If  F  denotes  the  predictive  CDF,  the  probability  integral  transform  is  simply 
the  value  F(x)  at  the  verification  x,  a  number  between  0  and  1.  For  the  Hope  Airport 
forecast  in  Figure  2(b),  for  instance,  the  PIT  value  was  0.15.  Rosenblatt  (1952)  studied 


the  probability  integral  transform,  and  Dawid  (1984)  proposed  its  use  in  the  assessment 
of  probabilistic  forecasts.  The  PIT  histogram  -  that  is,  the  histogram  of  the  PIT  values 
-  is  a  commonly  used  tool  in  the  econometric  literature  on  probabilistic  forecasting 
(see,  for  instance,  Weigend  and  Shi  2000).  Its  interpretation  is  the  same  as  that  of  the 
verification  rank  histogram:  Calibrated  probabilistic  forecasts  yield  PIT  histograms  that 
are  close  to  uniform,  while  under  dispersive  forecasts  result  in  U-shaped  PIT  histograms. 

How  can  ensembles  and  PDF  forecasts  be  fairly  compared?  An  ensemble  provides 
a  finite,  typically  small,  number  of  values  only,  while  PDF  forecasts  give  continuous 
statements  of  uncertainty;  so  this  seems  difficult.  There  are  two  natural  approaches  to  a 
fair  comparison,  using  either  the  verification  rank  histogram  or  the  PIT  histogram.  To 
obtain  an  m-member  ensemble  from  a  PDF  forecast,  take  the  CDF  quantiles  at  levels 
,  for  i  =  1, ...  ,  m.  The  verification  rank  histogram  can  then  be  formed  in  the  usual 
way.  To  obtain  a  PIT  histogram  from  an  ensemble,  fit  a  PDF  to  each  ensemble  forecast, 
as  proposed  by  Deque  et  al.  (1994),  Wilson  et  al.  (1999),  and  Grimit  and  Mass  (2004). 
The  standard  ensemble  smoothing  approach  of  Grimit  and  Mass  (2004)  fits  a  normal 
distribution  with  mean  equal  to  the  ensemble  mean  and  variance  equal  to  the  ensem¬ 
ble  variance.  The  PIT  value  is  then  computed  on  the  basis  of  the  fitted  Gaussian  CDF. 
Wilks  (2002)  proposed  to  smooth  forecast  ensembles  by  fitting  mixtures  of  Gaussian  dis¬ 
tributions,  an  approach  that  allows  for  multimodal  forecast  PDFs.  Multimodality  may 
indeed  be  an  issue  for  larger  ensembles.  For  smaller  ensembles,  such  as  the  University 
of  Washington  ensemble,  standard  ensemble  smoothing  using  a  single  normal  density 
suffices. 

In  addition  to  showing  verification  rank  histograms  and  PIT  histograms,  we  report 
the  coverage  of  the  66 1%  central  prediction  interval;  we  chose  this  interval,  because  the 
range  of  a  five-member  ensemble  provides  such.  Finally,  to  assess  sharpness,  we  consider 
the  average  width  of  the  66 1%  prediction  intervals.  For  a  five-member  ensemble,  this  is 
just  the  average  ensemble  range. 

For  Gaussian  predictive  PDFs,  the  average  width  of  the  100  x  (1  —  a)%  prediction 
intervals  is 

2*i-§  s  (3) 

where  £i-|  denotes  the  (1  —  2)  quantile  of  the  normal  distribution  with  mean  0  and 
variance  1,  respectively,  and  where  s  stands  for  the  average  standard  deviation  of  the 
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predictive  PDFs.  For  instance,  Table  2  below  shows  that  the  average  width  of  the  central 
66|%  prediction  intervals  for  MCE-EMOS  forecasts  of  sea  level  pressure  is  4.747.  From 
(3)  with  a  =  |  we  find  that  s  =  2.45.  Using  again  (3),  the  average  width  of  the  50% 
and  90%  central  prediction  intervals  is  3.31  and  8.08,  respectively. 

2.2  Scoring  rules 

Scoring  rules  for  the  verification  of  deterministic-style  forecasts  are  well-known  and  have 
been  widely  used  in  forecast  assessment.  If  /i*  denotes  a  deterministic-style  forecast  and 
Hi  is  the  verification,  the  mean  absolute  error  (MAE)  is  defined  as 

1  n 

MAE  =  /hi 

n  U 

where  the  sum  is  taken  over  the  test  data.  A  related  error  measure  is  the  mean-square 
error  (MSE),  defined  by 

1  n 

MSE=-^(2/4-/q)2. 

n  U 

The  root-mean-square  error  (RMSE)  is  the  square  root  of  the  MSE  and  has  the  advantage 
of  being  recorded  in  the  same  unit  as  the  verifications. 

We  also  consider  two  scoring  rules  for  the  assessment  of  predictive  PDFs,  the  contin¬ 
uous  ranked  probability  score  (Unger  1985;  Hersbach  2000;  Gneiting  and  Raftery  2004), 
and  the  ignorance  score  (Good  1952;  Roulston  and  Smith  2002).  These  scoring  rules  are 
attractive  in  that  they  address  calibration  as  well  as  sharpness. 

The  continuous  ranked  probability  score  (CRPS)  is  the  integral  of  the  Brier  scores 
at  all  possible  threshold  values  t  for  the  continuous  predictand  (Hersbach  2000;  Toth 
et  al.  2003,  Section  7.5.2).  Specifically,  if  F  is  the  predictive  CDF  and  y  verifies,  the 
continuous  ranked  probability  score  is  defined  as 

crps(F,  y)  =  I  ( F(t )  -  H(t  -  y)f  d t  (4) 

J  —  OO 

where  H(t  —  y)  denotes  the  Heaviside  function  and  takes  the  value  0  when  t  <  y  and  the 
value  1  otherwise.  Applications  of  the  continuous  ranked  probability  score  have  been 
hampered  by  a  lack  of  closed  form  expressions  for  the  associated  integral.  However, 
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when  F  is  the  CDF  of  a  normal  distribution  with  mean  //  and  variance  cr2,  repeated 
partial  integration  in  (4)  shows  that 


crpS(A%,  a\  y)  =  a  (±-±  (2  4  {«—!£)  -  l)  +  2  V -  -Fj  (5) 

where  ip  and  <f>  denote  the  PDF  and  the  CDF,  respectively,  of  the  normal  distribution 
with  mean  0  and  variance  1.  A  key  difference  between  the  ignorance  score  and  the  con¬ 
tinuous  ranked  probability  score  is  that  (5)  grows  linearly  in  the  normalized  prediction 
error,  z  —  (y  —  fij/cr,  while  (8)  grows  quadratically  in  z.  Hence,  the  ignorance  score  as¬ 
signs  harsh  penalties  to  particularly  poor  probabilistic  forecasts,  and  can  be  exceedingly 
sensitive  to  outliers  and  extreme  events  (Weigend  and  Shi  2000;  Gneiting  and  Raftery 
2004).  This  will  become  apparent  in  Tables  5  and  7  below.  Returning  to  the  continuous 
ranked  probability  score,  we  note  from  (4)  that  the  average  score 

1  n 

CRPS  =  -  53  crps(Fj,  Hi)  (6) 

n  i= i 

reduces  to  the  MAE  if  each  Ft  is  a  deterministic-style  forecast.  For  this  and  other 
reasons,  the  CRPS  can  be  interpreted  as  a  probabilistic  version  of  the  MAE. 

The  ignorance  score  is  the  negative  of  the  logarithm  of  the  predictive  density  /  at 
the  verifying  value  y,  that  is,  for  a  single  PDF  forecast, 

ign(/,?/)  =  —log  f(y).  (7) 

Roulston  and  Smith  (2002)  provide  an  interesting  information  theoretic  perspective  on 
the  ignorance  score.  In  the  case  of  a  normal  predictive  PDF  with  mean  /i  and  variance 
a2,  we  have 

ign(A f((i,  a2),y )  =  ^  ln(2vru2)  +  ^ (8) 
and  the  average  ignorance  is 

IGN  =  i  t V>)  ='-±(1 M2-?)  +  (^jf)  ■  (9) 

When  interpreting  improvements  in  the  IGN  score,  it  is  absolute  rather  than  relative 
changes  that  are  relevant.  An  improvement  of  A  in  the  IGN  score  corresponds  to  an 
increase  in  the  predictive  PDF  at  the  verifying  values  by  a  factor  of  eA. 
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Both  CRPS  and  IGN  are  negatively  oriented  scores,  in  that  a  smaller  value  is  better, 
and  both  scores  are  proper,  meaning  that  they  reward  honest  assessments.  We  report 
both  scores,  but  in  view  of  the  lack  of  robustness  of  the  ignorance  score,  we  prefer  the 
continuous  ranked  probability  score.  A  more  detailed  discussion  of  scoring  rules  is  given 
in  Gneiting  and  Raftery  (2004). 

2.3  Ensemble  model  output  statistics  and  minimum  CRPS  es¬ 
timation 

We  now  describe  the  ensemble  model  output  statistics  (EMOS)  method.  Suppose  that 
A"  i , . . . ,  Xm  denotes  an  ensemble  of  forecasts  for  a  univariate  weather  quantity  Y,  and 
let  S2  be  the  ensemble  variance.  The  EMOS  predictive  PDF  is  that  of  the  normal 
distribution 

A f  (a  +  biXx  + - b  bmXmi  c  +  dS 2)  .  (10) 

The  EMOS  predictive  mean  a  +  b\X\  +  -  ■  •- \-bmXm  is  an  optimal  bias-corrected  weighted 
average  of  the  ensemble  member  forecasts.  It  provides  a  highly  accurate  deterministic- 
style  forecast.  The  EMOS  predictive  variance  c+dS 2  is  a  linear  function  of  the  ensemble 
spread.  The  regression  coefficients  bi, . . . ,  bm  in  (10)  refiect  the  individual  member  model 
skill.  Stefanova  and  Krishnamurti  (2002)  argue  similarly  in  a  superensemble  context, 
but  they  do  not  constrain  the  regression  coefficients  b\, ... ,  brn  to  be  nonnegative.  The 
variance  coefficients  c  and  d  can  be  interpreted  in  terms  of  the  ensemble  spread  and 
the  skill  of  the  ensemble  mean  forecast.  All  else  equal,  larger  values  of  the  coefficient  d 
suggest  a  more  pronounced  spread-skill  relationship.  If  spread  and  error  are  independent 
of  each  other,  the  coefficient  d  will  be  estimated  as  negligibly  small.  Hence,  EMOS  is 
robust,  in  the  sense  that  it  adapts  to  the  presence  as  well  as  to  the  absence  of  significant 
spread-error  correlation. 

A  classical  technique  for  estimating  the  coefficients  a,  b\, . . . ,  bm,  c,  and  d  from  train¬ 
ing  data  is  maximum  likelihood  (Wilks  1995,  Section  4.7).  The  likelihood  function  is 
defined  as  the  probability  of  the  training  data  given  the  coefficients,  viewed  as  a  function 
of  the  coefficients.  In  practice,  it  is  more  convenient  to  maximize  the  logarithm  of  the 
likelihood  function,  for  reasons  of  both  algebraic  simplicity  and  numerical  stability.  The 
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log-likelihood  function  for  the  statistical  model  (10)  is 
t  (a;  &i, . . .  ,6m;  c;  d) 


(11) 


1  /  rC 

=  -  -  f  A;  log(2vr)  +  ^ 
V  2=1 


(Fj  —  (a  +  &iXj!  +  •  •  •  +  bmXim))- 


+  I]  log(c  +  dS?) 
1=1  / 


c+dSf 

where  the  sum  is  taken  over  the  training  data;  here,  Xu , . . . ,  Xirn  denote  the  ith  ensemble 
forecast  in  the  training  set,  Sf  denotes  its  variance,  and  Yt  denotes  the  ith  verification, 
respectively.  Strictly  speaking,  (11)  is  the  log-likelihood  function  under  the  assumption 
of  independence.  Note  that  the  log-likelihood  (11)  is  essentially  the  negative  of  the 
ignorance  score  (9),  but  is  applied  to  the  training  data  rather  than  the  test  data.  Hence, 
maximum  likelihood  estimation  is  equivalent  to  minimizing  the  ignorance  score  for  the 
training  data. 

This  observation  suggests  a  general  estimation  strategy:  Pick  a  scoring  rule  that  is 
relevant  to  the  problem  at  hand,  express  the  score  for  the  training  data  as  a  function 
of  the  coefficients,  and  optimize  that  function  with  respect  to  the  coefficient  values.  We 
take  scoring  rules  to  be  negatively  oriented,  so  a  smaller  value  is  better,  and  we  minimize 
the  training  score.  For  positively  oriented  scoring  rules,  we  would  maximize  the  training 
score.  Such  an  approach  is  formally  equivalent  to  minimum  contrast  estimation  (MCE), 
a  technique  that  has  been  studied  in  the  theoretical  statistics  literature  (Pfanzagl  1969, 
Birge  and  Massart  1993).  The  minimum  score  approach  can  also  be  interpreted  within 
the  framework  of  robust  M-estimation  (Huber  1964;  Huber  1981,  Section  3.2)  and  forms 
a  special  case  thereof,  in  that  the  function  to  be  optimized  derives  from  a  strictly  proper 
scoring  rule  (Gneiting  and  Raftery  2004).  A  more  detailed  methodological  and  the¬ 
oretical  discussion  is  beyond  the  scope  of  this  paper.  However,  we  compared  EMOS 
PDF  forecasts  estimated  by  MCE  with  the  continuous  ranked  probability  score,  as  de¬ 
scribed  below,  to  EMOS  PDF  forecasts  estimated  by  maximum  likelihood.  The  former 
performed  clearly  better:  the  predictive  PDFs  were  better  calibrated,  and  they  were 
sharper.  This  comparison  is  summarized  in  Table  2.  As  a  rule  of  thumb,  it  seems  that 
predictive  PDFs  estimated  by  maximum  likelihood  tend  to  be  over  dispersive,  resulting 
in  unnecessarily  wide  prediction  intervals  that  have  higher  than  nominal  coverage,  and 
in  inverted  U-shaped  PIT  histograms.  This  latter  shape  is  also  seen  in  Figures  4  and  5  of 
Weigend  and  Shi  (2000),  who  estimate  predictive  densities  by  the  maximum  likelihood 
method  in  the  form  of  the  EM  algorithm. 
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Table  2:  Comparison  of  EMOS  predictive  PDFs  obtained  by  maximum  likelihood  es¬ 
timation  (MLE-EMOS)  and  minimum  CRPS  estimation  (MCE- EMOS),  respectively. 
The  results  are  for  the  test  data,  region,  and  40-day  sliding  training  period  described  in 
Section  3. 


Score  Score  66 1%  Prediction 

Interval 

Average 


MAE 

RMSE 

CRPS 

IGN 

Coverage 

Width 

Sea  level  pressure 

MLE-EMOS 

1.958 

2.496 

1.391 

2.323 

69.41 

4.931 

MCE-EMOS 

1.953 

2.487 

1.389 

2.326 

67.61 

4.747 

Surface  temperature 

MLE-EMOS 

2.239 

2.914 

1.614 

2.490 

72.69 

5.909 

MCE-EMOS 

2.230 

2.906 

1.606 

2.488 

68.57 

5.411 

We  argued  in  Section  2.2  that  the  continuous  ranked  probability  score  (CRPS)  is 
a  more  robust  and  therefore  more  appropriate  scoring  rule  than  the  ignorance  score. 
This  suggests  the  use  of  the  continuous  ranked  probability  score  in  minimum  contrast 
estimation;  and  this  might  be  called  minimum  CRPS  estimation.  The  minimum  CRPS 
estimator  finds  the  coefficients  a,  b±, . . .  ,bm,  c,  and  d  in  the  statistical  model  (10)  that 
minimize  the  CRPS  value  for  the  training  data.  Using  (5)  and  (6),  we  express  the 
training  CRPS  as  an  analytic  function  of  the  coefficients,  namely 

r (a;  6m;  c;  d)  =  E(c  +  dSi)  (2 $(Z*)  -  l)  +  2 <p(Zi)  -  -J=)  (12) 

where 

v  _  (h^  —  (a  +  b1Xil  +  •  •  •  +  bmXim )) 

7+dSf 

is  a  standardized  forecast  error,  and  where  p  and  denote  the  PDF  and  CDF,  respec¬ 
tively,  of  a  normal  distribution  with  mean  0  and  variance  1.  We  find  the  coefficient  values 
that  minimize  (12)  numerically,  using  the  Broyden-Fletcher-Goldfarb-Shanno  algorithm 
(Press  et  al.  1992,  Section  10.7)  as  implemented  in  the  R  language  and  environment 
(www.cran.r-project.org/).  The  optimization  algorithm  requires  initial  values,  and  one 
way  of  specifying  them  is  by  least  squares  estimation  for  the  standard  multiple  linear 
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regression  model  (1).  In  the  sliding  window  implementation  of  Section  3,  we  use  the 
previously  estimated  EMOS  coefficients  as  initial  values  in  the  subsequent  optimization 
problem. 

How  do  we  constrain  the  coefficients  bi, . . .  ,b,m,  c,  and  d  to  be  nonnegative?  The 
nonnegativity  of  the  variance  term  c  is  not  an  issue.  To  enforce  the  nonnegativity  of  d, 
we  set  d  =  d2  and  optimize  over  5]  this  turned  out  to  be  numerically  stable.  To  enforce 
nonnegative  regression  coefficients,  we  proceed  stepwise.  We  first  find  the  unconstrained 
minimum  of  the  CRPS  value  (12).  If  all  estimated  regression  coefficients  are  nonnegative, 
the  EMOS  model  is  complete.  If  one  or  more  of  the  regression  coefficients  are  negative, 
we  set  these  to  zero,  and  we  minimize  the  CRPS  value  (12)  under  that  constraint.  We 
also  re-compute  the  ensemble  variance,  using  only  the  ensemble  members  that  remain 
in  the  regression  equation,  and  we  subsequently  use  the  re-computed  ensemble  spread. 
This  procedure  is  iterated  until  all  estimated  regression  coefficients  are  nonnegative. 

Table  3  illustrates  this  algorithm  for  predictions  on  25  May  2000,  the  day  on  which 
the  Hope  Airport  forecast  in  Figure  2  was  issued.  The  initial,  unconstrained  minimiza¬ 
tion  of  the  CRPS  value  (12)  uses  the  EMOS  coefficients  from  the  previous  fit  as  initial 
values.  This  results  in  a  negative  coefficient  for  the  third  ensemble  member  model,  the 
ETA-MM5  forecast.  We  set  this  coefficient  to  zero  and  proceed  with  the  constrained 
minimization,  resulting  in  a  negative  weight  for  the  NGM-MM5  forecast.  The  final 
EMOS  equation  uses  only  one  of  the  three  ensemble  members  initialized  with  NCEP 
models,  namely  the  AVN-MM5  forecast,  along  with  the  GEM-MM5  forecast,  and  the 
NOGAPS-MM5  forecast.  The  EMOS  weights  reflect  the  relative  skill  of  the  ensemble 
member  models  during  the  40-day  training  period.  Indeed,  the  bias-corrected  AVN- 
MM5,  GEM-MM5,  and  NOGAPS-MM5  forecasts  had  a  smaller  training  RMSE  than  the 
bias-corrected  ETA-MM5  and  NGM-MM5  forecasts.  The  AVN-MM5,  ETA-MM5,  and 
NGM-MM5  forecasts  were  initialized  by  NCEP  models,  and  they  were  highly  collinear. 
For  the  training  period,  the  pair  correlation  coefficients  within  this  group  ranged  from 
0.93  to  0.97.  EMOS  picked  and  retained  the  most  skillful  among  the  three  collinear  fore¬ 
casts.  The  correlation  coefficients  between  NCEP-  and  non  NCEP-initialized  member 
model  forecasts  were  also  high,  but  they  reached  at  most  0.92.  The  estimated  variance 
coefficient  d  turned  out  to  be  negligibly  small,  thereby  indicating  a  weak  spread-skill 
relationship  during  the  training  period.  Indeed,  the  correlation  coefficient  between  the 
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Table  3:  Minimum  CRPS  estimation  of  the  EMOS  coefficients  for  the  Hope  Airport 
forecast  PDF  in  Figure  2.  The  regression  coefficients  bi, . . . ,  b§  correspond  to  the  AVN- 
MM5,  GEM-MM5,  ETA-MM5,  NGM-MM5,  and  N0GAPS-MM5  forecast,  respectively. 


a 

h 

b2 

b3 

bA 

b§ 

c 

d 

Initial  values 

126.31 

0.31 

0.34 

0.00 

0.00 

0.24 

6.22 

0.00 

First  stage 

131.27 

0.37 

0.36 

-0.17 

0.02 

0.29 

5.77 

0.00 

Second  stage 

143.23 

0.35 

0.33 

— 

-0.09 

0.28 

5.81 

0.00 

EMOS 

130.34 

0.31 

0.31 

— 

— 

0.25 

5.88 

0.00 

ensemble  spread  and  the  squared  error  of  the  ensemble  mean  forecast  was  only  0.11 
for  the  40-day  training  period  that  we  used,  as  compared  to  0.33  for  the  entire  period, 
January-June  2000.  The  resulting  EMOS  predictive  PDF  is  Gaussian,  and  is  straight¬ 
forward  to  simulate  from.  An  alternative,  and  likely  preferable,  way  of  forming  an 
m-member  ensemble  from  the  predictive  PDF  is  by  taking  the  forecast  quantiles  at  level 
^-j-,  for  i  —  1, . . .  ,  m,  respectively.  In  this  way,  ensembles  of  any  size  can  be  obtained, 
and  in  this  sense,  EMOS  can  be  viewed  as  a  dressing  method  (Roulston  and  Smith 
2002). 

It  is  worth  pointing  out  that  the  EMOS  model  (10)  can  be  estimated  under  further 
constraints.  The  general  formulation  requires  that  the  ensemble  members  come  from 
individually  distinguishable  sources.  This  is  true  for  the  University  of  Washington  en¬ 
semble,  a  multi-analysis,  mesoscale,  short-range  ensemble,  and  also  for  poor  person’s 
and  multi-model  ensembles.  If  the  linear  regression  is  based  on  the  ensemble  mean  only, 
which  constrains  the  regression  coefficients  b\  —  ■■■  =  b.m  in  (10)  to  be  equal,  EMOS 
can  be  applied  to  essentially  all  ensemble  systems,  including  perturbed  observations, 
singular  vector,  and  bred  ensembles.  Jewson  et  al.  (2003)  applied  such  an  approach  to 
the  synoptic  ECMWF  ensemble,  using  maximum  likelihood  estimation.  However,  they 
did  not  report  out  of  sample  forecasts,  and  consequently  neither  verification  scores  nor 
rank  histograms.  For  the  University  of  Washington  ensemble,  the  general  formulation 
seems  preferable.  In  the  situation  of  Table  2,  constraining  the  regression  coefficients  in 
(10)  to  be  equal,  that  is,  using  the  ensemble  mean  only,  results  in  MAE,  RMSE,  and 
CRPS  scores  up  to  7%  worse,  as  compared  to  the  full  formulation. 
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3  Results  for  the  University  of  Washington  ensem¬ 
ble  over  the  Pacific  Northwest 

We  now  give  the  results  of  applying  EMOS  to  48-hour  forecasts  of  sea  level  pressure 
and  surface  temperature  over  the  northwestern  United  States  and  British  Columbia, 
using  Phase  I  of  the  University  of  Washington  ensemble  described  by  Grimit  and  Mass 
(2002).  The  University  of  Washington  ensemble  system  is  a  mesoscale,  short-range 
ensemble  based  on  the  Fifth-generation  Pennsylvania  State  University  -  National  Center 
for  Atmospheric  Research  Mesoscale  Model  (PSU-NCAR  MM5)  and  forms  an  integral 
part  of  the  Pacific  Northwest  regional  environmental  prediction  effort  (Mass  et  al.  2003). 
The  ensemble  used  a  0000  UTC  cycle  and  was  in  operation  on  102  days  between  12 
January  2000  and  30  June  2000;  it  is  described  in  Table  1.  During  this  period,  there  were 
16,015  and  56,489  verifying  observations  of  sea  level  pressure  and  surface  temperature, 
respectively.  Model  forecast  data  at  the  four  grid  points  surrounding  each  observation 
were  bilinearly  interpolated  to  the  observation  site  (Grimit  and  Mass  2002).  When  we 
talk  of  a  40-day  training  period,  say,  we  refer  to  the  most  recent  40  days  for  which 
ensemble  output  and  verifying  observations  were  available.  In  terms  of  calendar  days, 
this  period  typically  corresponds  to  more  than  40  days. 

3.1  Length  of  training  period 

What  training  period  should  be  used  for  estimating  the  EMOS  regression  coefficients 
and  variance  parameters?  There  is  a  trade-off  here.  Shorter  training  periods  allow  to 
adapt  rapidly  to  seasonally  varying  model  biases,  changes  in  the  performance  of  the 
ensemble  member  models,  and  changes  in  environmental  conditions.  On  the  other  hand, 
longer  training  periods  reduce  the  statistical  variability  in  the  estimation  of  the  EMOS 
coefficients.  We  considered  training  periods  of  19,  20,  . . . ,  62  days  for  forecasts  of  sea 
level  pressure.  For  comparability,  the  same  test  set  was  used  in  assessing  all  the  training 
periods,  that  is,  the  Erst  63  days  on  which  the  ensemble  was  operating  were  not  included 
in  the  test  data.  The  unit  used  for  the  sea  level  pressure  forecasts  is  the  millibar  (mb). 

The  results  of  this  experiment  are  summarized  in  Figure  3.  Figures  3(a)  and  3(b) 
show  the  mean  absolute  error  (MAE)  and  root-mean-square  error  (RMSE)  of  the  deter¬ 
ministic-style  EMOS  forecasts,  respectively.  These  decrease  sharply  for  training  periods 
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(a)  MAE  of  EMOS  Deterministic  Forecast 


(b)  RMSE  of  EMOS  Deterministic  Forecast 


(e)  Coverage  of  66.7%  Prediction  Intervals  (f)  Average  Width  of  66.7%  Prediction  Intervals 
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Figure  3:  Comparison  of  training  period  lengths  for  forecasts  of  sea  level  pressure  over 
the  Pacific  Northwest:  (a)  MAE  of  EMOS  deterministic-style  forecasts,  (b)  RMSE 
of  EMOS  deterministic-style  forecasts,  (c)  Continuous  ranked  probability  score,  (d) 
Ignorance  score,  (e)  Coverage  of  66|%  prediction  intervals,  (f)  Average  width  of  66|% 
prediction  intervals. 
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less  than  30  days,  stay  about  constant  for  training  periods  between  30  and  45  days, 
and  increase  thereafter.  Figures  3(c)  and  3(d)  show  the  continuous  ranked  probability 
score  (CRPS)  and  the  ignorance  score  (IGN).  The  patterns  are  similar  to  those  for  the 
MAE  and  the  RMSE.  The  coverage  of  EMOS  66 1%  prediction  intervals  is  shown  in 
Figure  3(e).  Training  periods  under  30  days  seem  to  result  in  under  dispersive  PDFs, 
but  training  periods  between  30  and  60  days  show  close  to  nominal  coverage.  Figure  3(f) 
shows  the  average  widths  of  the  66 1%  prediction  intervals.  The  average  width  increases 
with  the  length  of  the  training  period,  but  is  about  constant  for  training  periods  between 
30  and  40  days. 

To  summarize  these  results,  there  appear  to  be  substantial  gains  in  increasing  the 
training  period  beyond  30  days.  As  the  training  period  increases  beyond  45  days,  the 
skill  of  the  probabilistic  forecasts  declines  slowly  but  steadily,  presumably  as  a  result  of 
seasonally  varying  model  biases.  In  view  of  our  goal  of  maximizing  sharpness  subject 
to  calibration,  we  chose  a  40-day  training  period.  This  worked  well  for  temperature 
forecasts,  too.  However,  distinct  training  periods  might  work  best  for  distinct  variables, 
forecast  horizons,  time  periods,  and  regions.  Ideally,  we  would  include  training  data 
from  previous  years  to  address  seasonal  effects.  Further  research  in  this  direction  is 
desirable  as  multi-year  runs  of  stable  mesoscale  ensembles  become  available. 

3.2  Sea  level  pressure  forecasts 

We  now  give  the  results  for  EMOS  forecasts  of  sea  level  pressure,  using  a  40-day  sliding 
training  period  and  the  same  test  set  that  was  used  to  compare  the  different  training 
periods.  We  also  summarize  the  results  for  the  bias-corrected  ensemble  member  forecasts 
and  for  a  climatological  forecast.  The  bias-corrected  ensemble  member  forecasts  were 
obtained  by  simple  linear  regression,  estimated  on  the  same  40-day  sliding  training 
period.  The  deterministic-style  climatological  forecast  was  the  average  sea  level  pressure 
among  the  verifying  observations  in  the  training  period,  and  the  climatological  predictive 
PDF  was  obtained  by  fitting  a  normal  PDF  to  the  training  data. 

Figure  4  shows  the  estimates  of  the  EMOS  coefficients,  as  they  evolve  over  the  test 
period.  The  estimated  intercept  in  the  multiple  linear  regression  equation  is  shown  in 
Figure  4(a).  Figures  4(b),  (c),  (d),  (e),  and  (f)  show  the  EMOS  weights  for  the  five 
ensemble  member  models,  respectively.  The  weights  for  the  AVN-MM5,  CMC-MM5, 
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(a)  Intercept  (b)  AVN-MM5  Forecast  (c)  GEM-MM5  Forecast  (d)  ETA-MM5  Forecast 
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Figure  4:  Coefficient  estimates  for  EMOS  forecasts  of  sea  level  pressure  over  the  Pacific 
Northwest,  (a)  Intercept,  (b),  (c),  (d),  (e),  and  (f):  Member  model  weights,  (g)  and 
(h):  Variance  terms  c  and  d. 

and  NOGAPS-MM5  forecasts  were  consistently  substantial,  and  the  weights  for  the 
ETA-MM5  and  NGM-MM5  forecasts  were  consistently  negligible  or  zero.  This  is  a 
collinearity  effect,  very  much  like  the  effect  described  in  Section  2.3.  EMOS  retains 
only  one  of  the  three  highly  collinear  ensemble  member  models  that  were  initialized  by 
NCEP  analyses  and  picks  the  most  skillful  of  them,  namely  the  AVN-MM5  forecast. 
Figures  4(f)  and  4(g)  show  the  estimated  variance  coefficients  c  and  d,  respectively.  The 
estimates  of  c  decreased  during  the  test  period,  thereby  indicating  improved  ensemble 
skill  or  improved  atmospheric  predictability,  or  both.  The  values  of  d  were  small  but 
mostly  nonzero.  The  increase  toward  the  end  of  the  test  period  suggests  a  strengthening 
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of  the  spread-skill  relationship. 

Table  4  provides  summary  measures  of  deterministic-style  forecast  accuracy.  Among 
the  raw  ensemble  member  models,  the  AVN-MM5  forecast  performed  best.  Bias  cor¬ 
rection  resulted  in  a  reduction  of  the  RMSE  for  the  ensemble  member  model  forecasts 
between  4%  and  17%.  The  improvement  was  most  pronounced  for  the  NOGAPS-MM5 
forecast.  The  AVN-MM5,  CMC-MM5,  and  NOGAPS-MM5  forecasts  were  more  accu¬ 
rate  than  the  ETA-MM5  and  NGM-MM5  forecasts.  This  confirms  our  interpretation 
of  the  EMOS  weights  in  terms  of  the  individual  member  model  skill.  The  ensemble 
mean  forecast  performed  considerably  better  than  any  of  the  ensemble  member  mod¬ 
els.  However,  the  bias-corrected  AVN-MM5  forecast  and  the  bias-corrected  NOGAPS- 
MM5  forecast  were  more  accurate  than  the  mean  of  the  bias-corrected  ensemble.  The 
deterministic-style  EMOS  forecast  clearly  performed  best  and  had  RMSE  9%  and  8%  less 
when  compared  to  the  mean  of  the  raw  ensemble  and  to  the  mean  of  the  bias-corrected 
ensemble,  respectively.  The  results  in  terms  of  the  mean  absolute  error  (MAE)  were 
similar. 

Table  5  turns  to  summary  measures  of  probabilistic  forecast  skill.  The  climatological 
predictive  PDFs  showed  the  correct  coverage,  but  they  were  too  spread  out  to  be  com¬ 
petitive.  The  bias-corrected  ensemble  shows  reduced  ensemble  spread,  but  is  even  more 
underdispersive  than  the  raw  ensemble.  The  EMOS  prediction  intervals  show  accurate 
coverage.  The  continuous  ranked  probability  score  (CRPS)  and  the  ignorance  score 
(IGN)  were  computed  as  described  in  Section  2.2,  using  standard  ensemble  smooth¬ 
ing  for  the  raw  and  bias-corrected  ensemble,  respectively.  The  CRPS  score  can  also 
be  computed  directly,  by  using  the  empirical  ensemble  CDF,  which  takes  the  values 
0,|,...,|,1,  with  jumps  at  the  ensemble  member  forecasts.  This  gave  somewhat  higher 
CRPS  values  of  1.69  and  1.72  for  the  raw  ensemble  and  for  the  bias-corrected  ensemble, 
respectively.  The  EMOS  predictive  PDFs  had  by  far  the  best  scores  among  the  different 
forecasts.  When  compared  to  the  bias-corrected  ensemble,  EMOS  reduced  the  CRPS 
score  by  16%.  EMOS  reduced  the  IGN  score  by  3.68  points,  indicating  that  the  predic¬ 
tive  PDF  of  verifying  observations  increased  by  a  factor  of  40.  The  EMOS  prediction 
intervals  were  not  much  wider  than  prediction  intervals  obtained  from  the  raw  ensem¬ 
ble.  A  more  detailed  analysis  shows,  perhaps  surprisingly,  that  in  27%  of  the  forecasts 
the  EMOS  66|%  prediction  interval  was  shorter  than  the  range  of  the  five-member  raw 
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Table  4:  Comparison  of  deterministic-style  forecasts  of  sea  level  pressure  over  the  Pacific 
Northwest.  The  climatological,  bias-corrected,  and  EMOS  forecasts  were  trained  on  a 
sliding  40-day  period. 


MAE 

RMSE 

Climatological  forecast 

4.72 

5.83 

AVN-MM5 

2.20 

2.90 

GEM-MM5 

2.35 

3.00 

ETA-MM5 

2.50 

3.25 

NGM-MM5 

2.70 

3.40 

NOGAPS-MM5 

2.50 

3.21 

AVN-MM5  bias-corrected 

2.10 

2.68 

GEM-MM5  bias-corrected 

2.24 

2.88 

ETA-MM5  bias-corrected 

2.37 

3.14 

NGM-MM5  bias-corrected 

2.48 

3.23 

NOGAPS-MM5  bias-corrected 

2.10 

2.66 

Mean  of  raw  ensemble 

2.11 

2.73 

Mean  of  bias-corrected  ensemble 

2.08 

2.69 

EMOS  forecast 

1.95 

2.49 

ensemble.  In  9%  of  the  forecasts,  the  EMOS  66|%  prediction  interval  was  shorter  than 
the  range  of  the  bias-corrected  ensemble. 

The  verification  rank  histograms  for  the  raw  ensemble,  the  bias-corrected  ensemble, 
and  the  EMOS  ensemble  are  shown  in  Figure  5.  The  EMOS  ensemble  was  much  better 
calibrated  than  the  raw  ensemble  or  the  bias-corrected  ensemble.  Its  rank  histogram  is 
close  to  being  uniform  but  not  quite  uniform;  indeed,  the  latter  was  not  to  be  expected. 
Sea  level  pressure  is  a  synoptic  variable  with  strong  spatial  correlation  throughout  the 
ensemble  domain,  and  there  were  only  39  days  in  the  evaluation  period.  The  probability 
integral  (PIT)  histograms  in  Figure  6  accentuate  the  underdispersion  in  the  raw  ensemble 
and  the  bias-corrected  ensemble. 

3.3  Temperature  forecasts 

We  now  summarize  the  results  for  forecasts  of  surface  temperature,  a  case  of  primary 
interest  to  the  public  (Murphy  and  Winkler  1979).  The  2-m  temperature  forecasts  were 
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Table  5:  Comparison  of  predictive  PDFs  for  sea  level  pressure  over  the  Pacific  Northwest. 
The  bias-corrected  ensemble  and  the  EMOS  forecasts  were  trained  on  a  sliding  40-day 
period. 


66 1%  Prediction 
Interval 

Score 

Coverage 

Average 

Width 

CRPS 

IGN 

Climatological  forecast 

67.0 

11.83 

3.32 

3.19 

Raw  ensemble 

53.9 

3.93 

1.61 

4.84 

Bias-corrected  ensemble 

40.7 

2.77 

1.66 

6.01 

EMOS  forecast 

67.6 

4.75 

1.39 

2.33 

obtained  as  an  average  of  the  predicted  lowest  sigma  level  temperature  and  the  predicted 
ground  temperature.  Similar  to  the  sea  level  pressure  forecasts,  we  used  a  sliding  40- 
day  training  period,  and  we  considered  the  same  region  and  the  same  test  period.  We 
omit  the  results  for  the  climatological  forecast  which  is  even  less  competitive  than  for  sea 
level  pressure,  given  seasonal  and  topographic  effects.  The  unit  used  for  the  temperature 
forecasts  is  degrees  Kelvin. 

Figure  7  shows  how  the  estimates  of  the  EMOS  coefficients  evolve  over  the  test  period. 
Figure  7(a)  shows  the  estimated  intercept  which  is  consistently  small  and  negative. 
Figures  7(b),  (c),  (d),  (e),  and  (f)  show  the  estimated  EMOS  weights,  respectively.  The 
weights  for  the  AVN-MM5  forecast  reached  a  maximum  of  0.61  and  were  consistently 
the  highest  among  the  five  ensemble  member  models.  The  weights  for  the  ETA-MM5 
forecast  and  for  the  NOGAPS-MM5  forecast  were  smaller  but  still  substantial;  those  for 
the  NGM-MM5  forecast  were  generally  negligible  or  zero;  and  the  weights  for  the  GEM- 
MM5  forecast  were  initially  negligible,  before  increasing  to  substantial  levels.  These 
results  can  be  interpreted  in  terms  of  collinearity  and  ensemble  member  model  skill. 
The  correlation  coefficient  between  the  ETA-MM5  and  the  NGM-MM5  forecasts  was 
the  highest  among  the  forecast  pairs.  To  avoid  collinearity,  EMOS  retained  only  one  of 
them.  The  AVN-MM5  forecast  was  the  most  accurate  member  model  and  received  the 
highest  EMOS  weights.  Figures  7(f)  and  7(g)  show  the  estimated  variance  coefficients  c 
and  d,  respectively.  The  estimates  of  d  were  consistently  substantial. 
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Figure  5:  Verification  rank  histograms  for  ensemble  forecasts  of  sea  level  pressure  over 
the  Pacific  Northwest:  (a)  Raw  ensemble,  (b)  Bias-corrected  ensemble,  (c)  EMOS 
ensemble. 


Table  6  confirms  that  the  AVN-MM5  forecast  was  the  most  accurate  among  the  en¬ 
semble  member  forecasts,  both  before  and  after  bias  correction.  Bias  correction  resulted 
in  percentage  improvements  in  the  RMSE  of  the  ensemble  member  forecasts  between  4% 
and  14%,  and  the  NOGAPS-MM5  forecast  showed  the  highest  percentage  improvement. 
The  results  in  terms  of  the  MAE  were  similar.  The  deterministic-style  EMOS  forecast 
was  the  most  accurate,  even  though  the  percentage  improvement  over  the  bias-corrected 
ensemble  was  less  pronounced  than  for  forecasts  of  sea  level  pressure. 

We  now  turn  to  a  discussion  of  probabilistic  forecast  skill.  Table  7  shows  that  the 
bias-corrected  ensemble  was  slightly  better  calibrated  than  the  raw  ensemble.  However, 
both  the  raw  ensemble  and  the  bias-corrected  ensemble  were  strikingly  under  dispersive, 
and  this  was  reflected  in  the  CRPS  and  IGN  scores,  which  were  computed  on  the  basis 
of  standard  ensemble  smoothing.  When  computed  directly  from  the  ensemble  CDF,  the 
CRPS  scores  for  the  raw  ensemble  and  for  the  bias-corrected  ensemble  were  2.13  and 
1.95,  respectively.  The  EMOS  forecast  performed  best,  with  a  CRPS  score  that  was 
15%  lower  than  for  the  bias-corrected  ensemble,  and  an  IGN  score  that  was  13  points 
lower.  The  verification  rank  histograms  and  PIT  histograms  are  shown  in  Figures  8  and 
9.  The  PIT  histograms  accentuate  the  underdispersion  of  the  ensemble  forecasts,  while 
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(a)  Smoothed  Ensemble 


(b)  Bias-Corrected 


(c)  EMOS  Forecast 


Figure  6:  Probability  integral  transform  (PIT)  histograms  for  PDF  forecasts  of  sea  level 
pressure  over  the  Pacific  Northwest:  (a)  Smoothed  ensemble  forecast,  (b)  Bias-corrected 
smoothed  ensemble  forecast,  (c)  EMOS  forecast. 


the  histograms  for  the  EMOS  ensemble  are  close  to  being  uniform. 


4  Discussion 

It  is  well  documented  in  the  literature  that  multiple  regression  or  superensemble  tech¬ 
niques  improve  the  deterministic-style  forecast  accuracy  of  ensembles  systems  (Krish- 
namurti  et  al.  1999,  2000;  Kharin  and  Zwiers  2002).  Regression-based  forecasts  correct 
for  model  biases  and  therefore  are  more  accurate  than  the  ensemble  mean  forecast.  The 
novelty  of  our  ensemble  model  output  statistics  (EMOS)  approach  is  three-fold.  We 
constrain  the  regression  coefficients  to  be  nonnegative,  thereby  allowing  for  a  more  di¬ 
rect  interpretation  of  the  EMOS  coefficients  in  terms  of  ensemble  member  model  skill. 
EMOS  identifies  ensemble  members  whose  relative  contributions  are  negligible,  typically 
as  a  result  of  collinearity.  The  method  ignores  those  members  when  finding  the  EMOS 
predictive  mean,  an  optimal  bias-corrected  weighted  average  of  the  ensemble  member 
forecasts  that  provides  a  highly  accurate  deterministic-style  forecast.  For  estimating 
the  EMOS  coefficients,  we  use  the  novel  method  of  minimum  CRPS  estimation.  Fi¬ 
nally,  we  apply  linear  regression  techniques  to  obtain  full  predictive  PDFs,  rather  than 
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(a)  Intercept  (b)  AVN-MM5  Forecast  (c)  GEM-MM5  Forecast  (d)  ETA-MM5  Forecast 
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Figure  7:  Coefficient  estimates  for  EMOS  forecasts  of  surface  temperature  over  the 
Pacific  Northwest,  (a)  Intercept,  (b),  (c),  (d),  (e),  and  (f):  Member  model  weights,  (g) 
and  (h):  Variance  terms  c  and  d. 

deterministic-style  forecasts,  for  continuous  weather  variables.  The  EMOS  predictive 
PDFs  are  Gaussian,  and  they  take  account  of  the  spread-skill  relationship,  in  that  the 
predictive  variance  is  a  linear  function  of  the  ensemble  spread.  However,  EMOS  adapts 
to  the  absence  of  spread-error  correlation,  by  estimating  the  variance  coefficient  d  as 
negligibly  small.  Monte  Carlo  simulation  from  the  Gaussian  predictive  PDF  is  straight¬ 
forward,  and  forecast  ensembles  of  any  size  can  be  generated.  An  alternative,  and  likely 
preferable,  way  of  forming  an  m-member  ensemble  from  the  predictive  PDF  is  by  taking 
the  forecast  percentiles  at  level  ^-y  x  100%,  for  rn. 

We  applied  the  EMOS  technique  to  sea  level  pressure  and  surface  temperature  fore- 
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Table  6:  Comparison  of  deterministic-style  forecasts  of  surface  temperature  over  the 
Pacific  Northwest.  The  climatological,  bias-corrected,  and  EMOS  forecasts  were  trained 
on  a  sliding  40-day  period. 


MAE 

RMSE 

AVN-MM5 

2.45 

3.15 

GEM-MM5 

2.64 

3.40 

ETA-MM5 

2.52 

3.23 

NGM-MM5 

2.56 

3.28 

NOGAPS-MM5 

2.96 

3.76 

AVN-MM5  bias-corrected 

2.31 

3.00 

GEM-MM5  bias-corrected 

2.48 

3.24 

ETA-MM5  bias-corrected 

2.39 

3.10 

NGM-MM5  bias-corrected 

2.42 

3.13 

NOGAPS-MM5  bias-corrected 

2.50 

3.25 

Mean  of  raw  ensemble 

2.49 

3.18 

Mean  of  bias-corrected  ensemble 

2.28 

2.95 

EMOS  forecast 

2.23 

2.91 

casts  over  the  North  American  Pacific  Northwest  in  Spring  2000,  using  the  University  of 
Washington  mesoscale  ensemble  (Grimit  and  Mass  2002).  The  EMOS  predictions  were 
more  accurate  when  compared  to  the  member  model  forecasts,  bias-corrected  mem¬ 
ber  model  forecasts,  the  ensemble  mean  forecast,  and  the  ensemble  mean  of  the  bias- 
corrected  member  models.  We  also  assessed  the  probabilistic  forecast  skill  of  the  EMOS 
predictive  PDFs.  When  compared  to  the  bias-corrected  ensemble,  EMOS  PDF  forecasts 
of  sea  level  pressure  had  substantially  better  CRPS  and  IGN  scores.  The  EMOS  PDFs 
were  much  better  calibrated  than  the  raw  ensemble  or  the  bias-corrected  ensemble,  and 
they  were  sharp,  in  that  EMOS  prediction  intervals  were  much  shorter  on  average  than 
prediction  intervals  based  on  climatology.  Perhaps  surprisingly,  the  EMOS  forecasts  of 
sea  level  pressure  were  frequently  sharper  than  the  raw  ensemble  forecasts.  With  small 
modifications,  as  explained  in  Section  2.3,  EMOS  applies  to  all  ensemble  systems,  in¬ 
cluding  weather  and  climate,  synoptic-scale,  poor  person’s,  multi-model,  multi-analysis, 
perturbed  observations,  singular  vector,  and  bred  ensembles.  EMOS  can  be  applied  to 
gridded  ensemble  output,  thereby  providing  probabilistic  forecasts  on  a  grid.  The  re- 
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Table  7:  Comparison  of  predictive  PDFs  for  surface  temperature  over  the  Pacific  North¬ 
west.  The  bias-corrected  ensemble  and  the  EMOS  forecasts  were  trained  on  a  sliding 
40-day  period. 


66 1%  Prediction 

Score 

Interval 

Average 

Coverage 

Width 

CRPS 

IGN 

Raw  ensemble 

28.7 

2.55 

2.07 

21.45 

Bias-corrected  ensemble 

31.1 

2.44 

1.89 

15.50 

EMOS  forecast 

68.6 

5.41 

1.61 

2.49 

suiting  forecast  fields  can  be  visualized  in  the  form  of  percentile  maps,  as  in  Figure  5  of 
Raftery  et  al.  (2003).  In  our  experiment,  we  used  observations  to  estimate  the  EMOS 
coefficients,  but  this  could  also  be  done  using  an  analysis. 

Bias  correction  results  in  more  accurate  deterministic-style  forecasts,  and  bias  correc¬ 
tion  reduces  ensemble  spread,  by  pulling  the  individual  member  model  forecasts  towards 
the  verification  mean  (Eckel  2003).  Verification  rank  histograms  typically  become  more 
symmetric  after  bias  correction,  as  in  our  Figure  8,  or  in  Figure  46  of  Eckel  (2003). 
However,  bias  correction  does  not  result  in  improved  calibration,  and  the  need  for  sta¬ 
tistical  post-processing  remains.  We  anticipate  significant  improvements  in  probabilistic 
forecast  skill  through  the  use  of  advanced  bias  correction  schemes,  followed  by  statistical 
post-processing  of  the  bias-corrected  member  model  ensemble.  Further  research  in  this 
direction  is  desirable. 

We  close  with  a  discussion  of  potential  extensions  as  well  as  limitations  of  the  EMOS 
technique.  The  predictive  PDFs  produced  by  the  EMOS  method  are  Gaussian  and 
therefore  unimodal.  This  is  unlikely  to  be  a  disadvantage  for  a  five-member  ensemble, 
such  as  the  University  of  Washington  ensemble  that  we  considered.  However,  larger 
ensembles  occasionally  suggest  multimodal  forecast  PDFs.  The  ensemble  smoothing 
approach  of  Wilks  (2002)  and  the  Bayesian  model  averaging  approach  of  Raftery  et 
al.  (2003)  address  this  issue. 

We  obtained  EMOS  forecasts  of  sea  level  pressure  and  surface  temperature.  These 
are  variables  for  which  the  forecast  error  distributions  are  approximately  Gaussian.  The 


(a)  Raw  Ensemble  (b)  Bias-Corrected  Ensemble  (c)  EMOS  Ensemble 
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Figure  8:  Verification  rank  histograms  for  ensemble  forecasts  of  surface  temperature 
over  the  Pacific  Northwest:  (a)  Raw  ensemble,  (b)  Bias-corrected  ensemble,  (c)  EMOS 
ensemble. 

forecast  error  distributions  for  other  variables,  such  as  precipitation  or  cloud-cover,  are 
unlikely  to  be  close  to  normal.  Wilks  (2002)  proposes  ways  of  transforming  forecast 
ensembles  to  Gaussian  distributions,  and  EMOS  can  be  applied  to  the  transformed  en¬ 
semble.  Another  approach  uses  the  framework  of  generalized  linear  models  (McCullagh 
and  Nelder  1989),  and  this  remains  to  be  explored. 

Our  method  produces  predictive  PDFs  of  continuous  weather  variables  at  a  given 
location,  but  it  does  not  reproduce  the  spatial  correlation  patterns  of  observed  weather 
fields.  Gel  et  al.  (2004)  suggest  a  way  of  creating  ensembles  of  entire  weather  fields, 
each  of  which  honors  the  spatial  correlation  structure  of  verifying  fields.  However,  this 
approach  uses  only  one  numerical  weather  prediction  model  rather  than  an  ensemble  of 
forecasts.  This  method  could  be  combined  with  EMOS  to  yield  calibrated  ensembles 
of  entire  weather  fields,  by  simulating  correlated  error  fields  and  adding  them  to  the 
spatially  varying  predictive  mean  of  the  EMOS  forecasts.  Such  an  approach  could  also 
be  viewed  as  a  dressing  method  (Roulston  and  Smith  2003).  A  different,  somewhat 
simplistic  idea  is  what  might  be  called  a  probability  field  ensemble,  that  is,  an  ensemble  of 
m,  say  m  =  5,  weather  fields  showing  the  percentiles  of  the  EMOS  predictive  distribution 
function  at  the  levels  — (-y  x  100%  for  i  =  1  respectively.  Probability  field 
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Figure  9:  Probability  integral  transform  (PIT)  histograms  for  PDF  forecasts  of  surface 
temperature  over  the  Pacific  Northwest:  (a)  Smoothed  ensemble  forecast,  (b)  Bias- 
corrected  smoothed  ensemble  forecast,  (c)  EMOS  forecast. 

ensembles  do  not  reproduce  the  spatial  correlation  structure  of  observed  weather  fields, 
nor  do  they  take  account  of  dynamical  features.  However,  a  probability  field  ensemble 
could  be  interpreted  as  a  sample  of  equally  likely  weather  fields,  with  respect  to  any 
fixed  location,  which  may  facilitate  the  interpretation,  and  may  foster  the  acceptance 
and  the  use  of  probabilistic  forecasts. 
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